In [81]:
!pip install cufflinks
In [11]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')
import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
sns.set(rc={'figure.figsize':(25,15)})
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
#import cufflinks as cf
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [164]:
df = pd.read_csv('./db/googleplaystore.csv')
So, according to the description of the dataset as well as the descriptions of the columns we do have a dataset with information about the Google Play Store, we could apply many algorithms since we do have a great variety of tuples, more than 10k, and 13 features.
Our consulting firm has received a request to discover the most probable kind of mobile app that our client should concentrate efforts to maximize the return on investment. As the first glimpse, we considered the data from Google Play platform to make this study where were considered the lower break in point in terms of mobile products.
Therefore, we have acquired a very unique dataset with more than ten thousand instances and more than ten features from Google Store Platform to make the analysis of the next killer app.
The dataset is a collection of Google Play Store and there is a good variety of information about a number of installations, the genre of the application, the reviews and other features making 13 features total and we do have about eleven thousand information of apps from the mobile platform.
So, now, let's embrace what data we do have in the dataset, let's take a sample below:
In [165]:
df.sample(10)
Out[165]:
we could check that our dataset presents 10.841 tuples
In [166]:
len(df)
Out[166]:
and 13 columns/features
In [167]:
df.columns
Out[167]:
Where each column/feature represents:
App - The name of the application Category - The category of the application Rating - The rate given by the users Reviews - The number of reviews given by the users Size - The size of the application Installs - Number of installs Type - If the application is paid or free Price - The price charged Content Rating - The age rating Genres - The kind of category of the application Last updated - The date when it was last updated Current ver - The current version number, the last version number Android ver - The minimum android compatible version Now let's see what each dataset feature presents in terms of its data, like the range of the data, the categories it has and etc.
In this section, we try to adjust de data from our dataset. Why do we need to do that? Because the datasets usually come with some problems, like missing data in some tuples, data that make no sense (like 100 sons, or last updated in 1900 when the data recording was after 2000), and those problems can make the results go wrong.
So what do we need to do? First step to proceed is to:
All these steps are following the assignment description.
But we need to put more effort into this dataset, like
In [168]:
df.drop_duplicates(subset='App', inplace=True)
df = df[df['Android Ver'] != np.nan]
df = df[df['Android Ver'] != 'NaN']
df = df[df['Installs'] != 'Free']
df = df[df['Installs'] != 'Paid']
df['Installs'] = df['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: int(x))
df['Installs'] = df['Installs'].apply(lambda x: float(x))
df['Size'] = df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(x))
df['Reviews'] = df['Reviews'].apply(lambda x: int(x))
df['Price'] = df['Price'].apply(lambda x: str(x).replace('$', '') if '$' in str(x) else str(x))
df['Price'] = df['Price'].apply(lambda x: float(x))
So, analyzing the number of 'missing' data we have
In [169]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(6)
Out[169]:
And we need to take action about this situation dropping all instances with missing values from our dataset
In [170]:
df.dropna(how ='any', inplace = True)
Checking the result of this operation we conclude with less number of instances:
In [171]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(6)
Out[171]:
In [172]:
print(len(df))
In [173]:
10841 - 7021
Out[173]:
In [174]:
df["Price"].describe()
Out[174]:
In [175]:
x = df['Rating'].dropna()
z = df['Installs'][df.Installs!=0].dropna()
p = df['Reviews'][df.Reviews!=0].dropna()
t = df['Type'].dropna()
price = df['Price']
p = sns.pairplot(pd.DataFrame(list(zip(x, np.log(z), np.log10(p), t, price)),
columns=['Rating', 'Installs', 'Reviews', 'Type', 'Price']), hue='Type', palette="Set2")
In [74]:
print("\n", df['Category'].unique())
In [176]:
print(df["Genres"].unique())
The feature 'Price' is one of the columns that need some normalization. As we can see below, we have a mean of $1,17 per app and the data is very skewed. Below we plotted a chart with the distribution of this
In [177]:
from scipy import stats
sns.distplot(df["Price"], kde=False, fit=stats.norm);
In [89]:
df["Price"].describe()
Out[89]:
In [178]:
dfmod = df
dfmod["Price"] = (df["Price"] - df["Price"].min()) / (df["Price"].max() - df["Price"].min())
In [179]:
dfmod["Price"].describe()
Out[179]:
So as we can see below we now have a normalized data for the price in another DataFrame following the same histogram. As Price, there is other information that we need to check like the number of installs, rating, reviews, and size.
In [180]:
sns.distplot(dfmod["Price"], kde=False, fit=stats.norm);
About the information of 'Installs' we have a categorized information, dividing the number of downloads in chunks as we can notice below
In [181]:
%%capture
dfmod["Installs"] = (df["Installs"] - df["Installs"].min()) / (df["Installs"].max() - df["Installs"].min())
Let's see how the distribution of the rating values is presented. So, we do have a mean really high of 4.16 in ratings' data and a standard deviation of around 0.56. Notice that the minimum note is 1 and that the first quartile is about 4 in ratings, in other words, we do have a skewed data with a long left long tail. We already normalized this feature.
In [182]:
print(dfmod["Rating"].describe())
sns.distplot(df["Rating"], kde=False, fit=stats.norm);
In [158]:
sns.distplot(df["Installs"], rug=True, hist=False)
Out[158]:
If we going to utilize the method for classification we should classify based on a label. For this purpose, we will consider, based on the distribution of the rating. So, we'll propose the following
In [183]:
print(df["Rating"].describe())
print("\n", df['Rating'].unique())
sns.distplot(df["Rating"], hist=True)
dfmod.loc[(dfmod['Rating'] >= 0.0 ) & (dfmod['Rating'] <= 4.25 ), 'label_rating'] = '0 bad'
dfmod.loc[(dfmod['Rating'] > 4.25 ) & (dfmod['Rating'] <= 4.75 ), 'label_rating'] = '1 normal'
dfmod.loc[(dfmod['Rating'] > 4.75), 'label_rating'] = '2 good'
print(dfmod['label_rating'].unique())
Now it's after we categorize the new label, we a ready to normalize the data from the rating. Remembering that we are normalizing the data from 0-1.
In [185]:
dfmod["Rating"] = (df["Rating"] - df["Rating"].min()) / (df["Rating"].max() - df["Rating"].min())
sns.distplot(dfmod["Rating"], hist=True)
Out[185]:
We should also normalize reviews and price data
In [192]:
dfmod["Installs"] = (df["Installs"] - df["Installs"].min()) / (df["Installs"].max() - df["Installs"].min())
dfmod["Price"] = (df["Price"] - df["Price"].min()) / (df["Price"].max() - df["Price"].min())
dfmod["Reviews"] = (df["Reviews"] - df["Reviews"].min()) / (df["Reviews"].max() - df["Reviews"].min())
dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]]
Out[192]:
In [198]:
#sns.dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]]
#sns.pairplot(data = dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]])
x = dfmod['Rating']
z = dfmod['Installs']
p = dfmod['Reviews'][df.Reviews!=0].dropna()
t = dfmod['label_rating'].dropna()
price = df['Price']
p = sns.pairplot(pd.DataFrame(list(zip(x, z, p, t, price)),
columns=['Rating', 'Installs', 'Reviews', 'label_rating', 'Price']), hue='label_rating', palette="Set2")
What this pair plot above shows us is how the mobile market works. The first only plot that we noticed was the $rating x install$ chart. In this chart we can notice that the highest ratings received are way less the number of installs, that same behavior can be observed with the $rating x reviews$ the third observation is over the $rating x price$ chart, in this chat we can conclude that there is no 'good' software with the high rating!
Observing the second row, third and fourth columns, we can see that the number of reviews is way low, it doesn't matter if its bad, normal or good and it also indicates a poor correlation between these two features. The $installs x price$ chart shows that there is no good paid app with a high number of installs and that there is a normal and bad software way above the mean of price in the google play store
As we are talking about a descriptive database we are going to explore unsupervised methods in this section. Notably the k-means algorithm and the Agglomerative Clustering.
The k-means is an algorithm destined to a class of problems called unsupervised problems. This means that from the data itself we will try to clusterize the data or divide the dataset into clusters. So, we do not need labels or even the number of clusters. As in innumerous other methods, the k-means tries to minimize its cost function and this kind of algorithm is susceptible to local minima.
As proposed in the assignment description, we should consider a k for a minimum of two and maximum of $CEILING(log_2{n})$, where $n$ is the number of instances that we've in our dataset.
So, after all the pre-processing section we ended up with around 7k instances, it'll give us approximately $CEILING(log_2{7000}) = CEILING(12.7731392067)$ and we should round up this number to consider $13$ as the number of $k's$ that we should use in our experimentations.
So, making use of scikit-learn library we could code the experimentation with the k-means method using:
In [249]:
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn import preprocessing, metrics
from sklearn.cluster import AgglomerativeClustering
from time import time
The k-Means algorithm coded in scikit-learn utilizes the Lloyd’s or Elkan’s algorithm [http://www.vlfeat.org/api/kmeans-fundamentals.html]. Have a complexity of O(k n T), k is the number of clusters, n is the number of samples e T the number of interactions of the algorithm.
The algorithm will do 15 times, from k = 3 to k = 18, x 3 times, the number of different seeds asked by the assignment description, values 37,110 and 777. We're using a parallel with 4 jobs, n_jobs, to execute the experimentations. In this case, we going to use 300 iterations per execution of the algorithm, the default value and using the k-Means++ implementation.
In [250]:
seed_initialization = [37,110,777]
n_jobs = 4
sample_size_ = 300
dataclust = dfmod[['Rating', 'Installs', 'Price', 'Reviews' ]];
table_results = []
for i in range(3,13):
for y in range(3):
t0 = time()
estimator = KMeans(init='k-means++', n_clusters=i, random_state = seed_initialization[y], n_jobs = n_jobs)
estimator.fit(dataclust)
print(time()-t0)
print("k: " + str(i))
print("seed: " + str(seed_initialization[y]))
db = metrics.homogeneity_score(dfmod['label_rating'], estimator.labels_)
si = metrics.silhouette_score(dataclust, estimator.labels_, metric='euclidean',sample_size=sample_size_)
result_estimator = {"kmeans", i, seed_initialization[y], db, si}
table_results.append(result_estimator)
estimator2 = AgglomerativeClustering(n_clusters=6, linkage='ward').fit(dataclust)
db2 = metrics.homogeneity_score(dfmod['label_rating'], estimator.labels_)
si2 = metrics.silhouette_score(dataclust, estimator.labels_, metric='euclidean',sample_size=sample_size_)
result_estimator = {"AgglomerativeClustering", 6, "-", db2, si2}
table_results.append(result_estimator)
All the metrics were collected in the same step as the clustering algorithms, in the last step. So we show the results
In [251]:
result = pd.DataFrame(table_results, columns=["DB", "Silhouete","K","Seed","Algorithm"])
In [252]:
result
Out[252]:
The Friedman Test is a version of repeated-measures of analysis of variance (ANOVA) that can be executed on ranked data.
So, to do a Friedman's ANOVA, we should do:
Define null and alternative hypotheses $H_0$: there is no difference between the three conditions $H_1$: there is a difference between the three conditions
State alpha $\alpha$ $\alpha = 0.5$
Calculate degrees of freedom Using $df = k -1$ $df = 3 - 1 = 2$
State decision rule In this step, we usually use the values of 0.050, 0.010 and 0.001
Using the website indicated by the Professor, https://www.socscistatistics.com/tests/friedman/Default.aspx, we run the test with the Silhouette metric of the different algorithms and seeds and the results are in the following images
So, according to the test here presented there is no significant difference between the two models, or parametrized models, that we utilized in this experimentation meaning that any of the models should bring equivalent results. Probably, a tunning in the model should result in better models but this tunning is for next works.